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The need for automated speech recognition has expanded as a result of 
significant industrial expansion for a variety of automation and human- 
machine interface applications. The speech impairment brought on by 
communication disorders, neurogenic speech disorders, or psychological 
speech disorders limits the performance of different artificial intelligence- 
based systems. The dysarthric condition is a neurogenic speech disease that 
restricts the capacity of the human voice to articulate. This article presents a 
comprehensive survey of the recent advances in the automatic dysarthric 
speech recognition (DSR) using machine learning (ML) and deep learning 
(DL) paradigms. It focuses on the methodology, database, evaluation 
metrics, and major findings from the study of previous approaches. From the 
literature survey it provides the gaps between exiting work and previous 
work on DSR and provides the future direction for improvement of DSR. 
The performance of the various machine and DL schemes is evaluated for 
the DSR on UASpeech dataset based on accuracy, precision, recall, and F1- 
score. It is observed that the DL based DSR schems outperforms the ML 
based DSR schemes. 
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1. INTRODUCTION 


Dysarthria is a speech disorder generated due to weakness in speed production muscle or when an 
individual is unable to control them. It frequently causes slow or slurred speech which is difficult to 
understand. Dysarthria can be caused due to neural disorder, troat or tongue muscle weakness, or facial 
paralysis [1], [2]. The muscle used for speed production is controlled by the nervous system and brain. 
Mostly dysarthria is caused due to damage to these muscles. Dysarthria is grouped into developmental and 
acquired dysarthria. The developmental dysarthria normally found in children is occurred due to brain 
damage during or before birth. The acquired dysarthria generally occurred due to brain damage in adulthood 
or later in life such as brain tumors, stroke, head injury, motor neuron disease, or Parkinson's disease [3]-[5]. 

The term "dysarthria" refers to a variety of neurological speech abnormalities caused by injury to the 
central or peripheral nerve systems. Reduced stress, sluggish speech pace, hyper-nasality, muscular stiffness, 
spasticity, monopitch, and a limited range of speech motions are all signs of dysarthric speech. It can impact 
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the subglottal, laryngeal, and articulatory systems, which can make speech production difficult. Stroke, 
Parkinson's disease, and cerebral palsys are the most common roots of motor speech difficulties. According 
to reports, improving human-machine interaction for persons with dysarthria is becoming increasingly 
important in order to boost overall wellness and independence. Physical impairments are common in people 
with dysarthria, making common input methods (typing and touch screen) difficult to use [6], [7]. 

Traditionally, the language or speech therapist diagnosed dysarthria disorder by asking people to read 
passages loudly, recite numbers or weekdays, make various sounds or talk about any familiar topic. The traditional 
techniques performance is limited due to various factors such as inadequate knowledge of experts, tiredness, and 
fatigue. Dysarthria may affect phonation, breathing, prosody, articulation, resonance, and lip movement. It shows a 
larger variation in speech intelligibility. The scope of intelligibility is huge and may depend upon the extent of 
nervous system damage. The typical symptoms of the dysarthria are listed in Figure 1. Because of articulatory 
difficulties, there is no uniformity in articulation. Pronunciation changes and speaking pace slows as a result of 
exhaustion. All of these distinctiveness impair the dysarthric speaker's intelligibility (the degree to which others can 
understand their speech) and limit verbal interactions, reducing their quality of life [8], [9]. 


+ Slurred, breathy speech or nasal sounding 
e Very quiet or loud speech 

e Monotonous speech 

* Wilson's disease 

+ Difficulty in lip and tongue movement 

e Resonance 

* Constant drooling due to difficulty in swallowing 
e Cerebral palsy 

e Lyme disease 

* Unable to whisper 

e Myasthenia gravis 

e Hoarse or strained voice 

¢ Breathing problem 

e Phonation 


Typical symptoms of dysarthria 


Figure 1. Typical symptoms of dysarthria 


The classification system helps to narrow down the dimension of perceptual analysis of dysarthric 
speech. The classification of dysarthric speech is given in Figure 2. Most clinicians find this useful to correct 
or reduce the deficit found in dysarthric speech production. Normal speakers typically communicate at rates 
between 150 to 200 words per minute. The speech is clear, timely, and contextually relevant. Speakers with 
severe impairments communicate at a rate of fewer than 15 words per minute. This reduction in the rate of 
communications has implications in the quantity and the quality. People suffering from dysarthria are 
generally physically challenged. It is difficult for them to handle the conventional keyboard or mouse 
interfaces. Dysarthric speakers experience difficulty to contribute enough samples of speech data. Some 
dysarthric speakers get tired soon which may lead to distress. They often fall short to utter certain sounds, 
which results in phonetic variation [10], [11]. 


Flaccid Spastics 


Ataxic Hypo kinetic Hyper kinetic 


e Resonatory incompetence e Prosodic excess 
* Phonatory incompetence ¢ Phonatory stenosis 
* Phonatory-prosodic  Articulatory-resonatory 


Articulatory inaccuracy * Prosodic 

+ Phonatory- prosodic insufficiency 
insufficiency 

e Prosodic excess 


e Prosodic insufficiency 

¢ Phonatory stenosis 

e Resonatory incompetence 

e Prosodic excess 

 Articulatory-resonatory 
incompetence 

e Articulatory inaccuracy 


insufficiency incompetence 
e Prosodic insufficiency 


Figure 2. Types and features of dysarthria 
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The generalized process of dysarthric speech recognition (DSR) is shown in Figure 3 that 
encompasses the pre-processing, feature representation, classification, and DSR. The pre-processing phase 
deals with the primary processing on the dysarthric speech to improve the quality of features and 
performance of the classifiers. It encompasses framing, cropping, speech separation, noise suppression, 
windowing, normalization, speech enhancement, and data augmentation. The dysarthric speech contains 
different types of the reverberations, silent regions, stops, wide variety in pitch, and energy of the signal 
which tends to use speech enhancement to enhance DSR effectiveness. The feature extraction is important 
phase to collect the distinctive and unique characteristics of the normal and dysarthric speech. The features 
are generally grouped into spectral, prosodic, voice quality, and teager-energy operator features. Traditional 
machine learning (ML) based DSR includes feature extraction followed by classification whereas in deep 
learning (DL) the feature extraction may not be used as DL techniques often refers to combination of hidden 
feature extraction layers and classification layer. However, many hybrid DL algorithms uses the traditional 
features as the input to boost the speech intelligibility, feature representation, and DSR accuracy. 


Pre- Feature Classification Speech 
processing Extraction Recognition 


> Winders 

e Noise 
Reduction CNN 

e = Normalization DNN 

e = §=6Voice activ ity RNN 
detection LSTM 

e Enhancement AE 

e Data TL 


Augmentation 


Prosodic Spectral 
Features Features 


Voice Quality Teager Energy 
Features Operator (TEO) 


e Pitch e MFCC e Jitter e TEQ-FM-Var 
e Length e Lpce e Shimmer © TEQ-Auto-Env 
e Energy e GFcc e HNR e TEQ-CB-Auto-Env 
è  Formants e Normalized 
Amplitude 
Quotient 
e Quasi Open 
Quotient 


Figure 3. Generalized process of DSR 


Various DSR strategies have been presented in last two decades. This section gives a quick 
overview of recent DSR approaches. Voice tremor has been quantified using phonation parameters that 
define disordered voice, such as jitter and fundamental frequency [9], [12]. To avoid the gender and acoustic 
environment dependence of these parameters, a pitch period entropy-based evaluation was developed [13]. 
Hypophonia has also been described using fluctuation of energy and short-time energy [14]. The 
Teager-Kaiser energy operator which provides the speech intensity measure is utilized to adjust for signal 
frequency [15]. To explore the influence on articulatory dynamics and speech intelligibility, acoustic cues 
based on the first three formants and their respective bandwidths can be studied [16]. Vowel space area 
(VSA) has been investigated for assessing speech intelligibility [17]. A support vector machine (SVM) 
classifier was used to investigate a method for distinguishing dysarthric speech from healthy speech using a 
collection of glottal and openSMILE characteristics [18]. Gurugubelli and Vuppala [19] investigated analytic 
phase characteristics generated from voice signals using the single frequency filtering (SFF) approach. Audio 
descriptor information used for determining musical instrument timbre were combined with an artificial 
neural network (ANN) model to classify dysarthric speech severity levels [20]. For dysarthria classification, 
multi-tapered spectral estimation was used to extract audio descriptor features. 

Research by Johnson et al. [21] evaluate recognition performance for dysarthric speech compared 
with automatic speech recognition (ASR) systems based on Gaussian mixture model (GMM) hidden Markov 
models (HMMs) and SVMs [22]. The experimental results showed that the HMM-based model may provide 
robustness against large-scale word-length variances. Meanwhile, the SVM-based model can alleviate the 
effect of deletion of or reduction in consonants. Rudzicz [23] investigated acoustic models of GMM-—HMM, 
conditional random field, SVM, and ANNs [24]. The results showed that the ANNs provided higher accuracy 
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than other models. Revathi et al. [25] presented multiple such as Gamma tone energy (GFE), modified group 
delay function cepstrum (MGDFC), and stock well features for isolated DSR. It used decision level fusion 
with the help of vector quantization (VQ) classifier. It used speech enhancement scheme to minimize the 
distortions and improve the speech intelligibility. It resulted in word error rate (WER) of 4% for the 
dysarthric subjects with 6% intelligibility. Qatab and Mustafa [26] used four types of features such as 
spectral, cepstral, voice quality, prosodic, and overall speech features along with SVM, ANN, linear 
discriminent analysis (LDA), classification and regression tree (CART), Naive Bayes (NB), and random 
forest (RF) classifier for DSR. Seven feature selection algorithms have been presented for the feature 
selection to select the dominant features such as conditional information feature extraction (CIFE), double 
input symmetrical relevance (DISR), interaction capping (ICAP), conditional mutual information 
maximization (CMIM), conditional redundancy (Condred), joint mutual information (JMI), and relief. It 
provided average ranking score of 4.88 for RF and relief feature selection. Janbakhshi et al. [27] presented 
singular value decomposition (SVD) for the spectro-temporal representation of the dysarthric speech and 
temporal grassmann discriminant analysis (T-GDA) for the DSR. It outperformed the traditional mel 
frequency cepstral coefficient (MFCC)-SVM based DSR. The subspace based learning shows superior 
discrimination between normal and dysarthric speech. The temporal subspace gives enhanced performance 
compared with spectral subspace. 

Recently, DL technology has been widely used in many voiced based automation systems and has 
proven it can provide better performance than conventional ML based methods [28], [29]. Fathima et al. [30] 
applied a multilingual time delay neural network (TDNN) system that combined acoustic modeling and language 
specific information to increase ASR performance. The experimental results showed that the TDNN-based ASR 
system achieved suitable performance, as the WER was 16.07% in this study. Yue et al. [31] investigated 
convolutional and light gated recurrent unit (LiGRU) based multi-spectra acoustic model for DSR. It used data 
augmentation to minimize the data scarcity problem using speed perturbation which has given 11% and 40.6% 
WER for normal and dysarthric speech. Yue et al. [32] developed multi-stream acoustic model based on 
convolutional neural network (CNN), LiGRU, and fully connected multi layer perceptron (MLP) and optimal 
fusion technique for DSR. The proposed model provided a WER of 4.6% for the pre-processed data using 
electromagnetic articulography (EMA). The EMA pre-processing includes Butterworth filter for measurement 
noise minimization and down-sampling for synchronization of MFCC features. 

The data efficiency is major obstacle in the DSR. Soleymanpour et al. [33] proposed text to speech 
(TTS) synthesizer for the data augmentation based on FastSpeech model. The augmented data provided to 
deep neural network (DNN)-HMM with light bidirectional GRU that has given a WER improvement of 
12.2% over the baseline model. Traditional data augmentation approaches majorly focuses on the temporal 
variations of the signal however spectral envelope remains same. Liu et al. [34] presented vocal tract length 
perturbation (VTLP), tempo perturbation and speed perturbation for the data augmentation that concentrates 
on temporal as well as spectral transformations of the dysarthric speech signal. The DNN and Neural 
architecture search (NAS) based DSR provides WER of 25.21 % and 5.4% for UASpeech and CUHK dataset 
respectively. Shahamiri [35] used voicegram to provide the correlation between phonemes and the dysarthric 
speech. The visual data augmentation model is used for the data augmentation to minimize data scarcity 
problem in DSR. The spatial-convolutional neural network (S-CNN) provides an accuracy of 67% on 
UASpeech dataset. The proposed S-CNN some time causes vanishing gradient problem and provides poor 
results for the moderate dysarthria. The intelligibility of the speech is hugely affected due to time domain 
variance of dysarthric speech and background noise. Lin et al. [36] suggested that the DL based voice 
conversion (DVC) using phonetic posteriorgram (PPG) provides stable performance compared with DVC- 
mel under noisy condition. 

Kodrasi and Bourlard [37] suggested that spectro-temporal sparsity using the Gini index provided 
better performance than shimmer, jitter, fundamental frequency, harmonics to noise ratio (HNR), and MFCC 
for the DSR. It is observed that spectral sparsity has proven better performance than temporal sparsity. 
Kodrasi [38] used CNN for learning the temporal spectral characteristics obtained using temporal envelope 
and fine structure (TEFS). The TEFS outperformed the traditional short-time fourier transform (SIFT) based 
speech signal spectrogram. The TEFS-CNN provides 85.72% accuracy for DSR whereas SIFT-CNN provides 
69.76% accuracy for DSR. Chandrashekar et al. [39] investigated the time—frequency CNN for capturing the 
temporal as well as spectral properties of the dysarthric speech. The spectro-temporal properties of the speech 
signals are obtained using SIFT, spectrograms using SFF, and constant Q-transform (CQT). The DSR 
performance has shown higher accuracy for the female subjects compared with the male subject. The training 
data deficiency resulted in class imbalance problem. The time-frequency based CNN provides better spectro- 
temporal variation of the dysarthric speech which has shown significant improvement in DSR accuracy over 
the traditional ANNs [40]. Fritsch and Doss [41] presented recurrent neural network (RNN) based binary and 
CNN based multi-feature classifier. It provided high correlation for synthesized speech generated using TTS. 
Table 1 provides the summary of various DSR techniques based on ML and DL approaches. 
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Table 1. Summary of ML and DL based DSR 

Ref. Speech Dale : Featre Classifier Database berloimands Remark 

enhancement augmentation extraction metrics 

[31] Cepstral Speed CNN- Softmax TORGO WER -40.6% | Combination of excitation 

processing to perturbation LiGRU (dysarthric), and vocal tract component 
separate filter 11% (normal) can be used for speaking 
and speech stylemodelling 

element 

[32] EMA - CNN- Softmax TORGO WER -4.6% Over-fitting problem for 
LiGRU- high level articulatory 
FCMLP feature fusion 

[33] - TTS DNN- Softmax TORGO WER -41.6% The severity of dysarthric 
HMM- speech depeds upon 
BLiGRU energy, duration and pitch 

of the signal. 

[34] - VTLP, tempo Model based DNN-NAS UASpeech - WER=25.21% High WER for low 

perturbation speaker and (UASpeech) intelligibility speaker 
and speed adaptation Chinese - WER=5.4% 
perturbation and cross- University (CUHK) 
domain of Hong 
generation of Kong 
visual features (CUHK) 
[35] - Visual data Voicegram S-CNN UASpeech Accuracy= - Provides less temporal 
augmentation 67% representation of speech 
- May cause vanishing 
gradient problem 
[36] - - - CNN witha 10samples © CNN—PPG- Class imbalance problem 
PPG of 19 93.49%, issue due to uneven 
Chinese CNN-MFFC- dataset size 
commands 65.67%, ASR 
for 3 users based system- 
89.59% 

[37] - - Spectro- SVM Spanish Accuracy= - Less recognition rate due 
temporal database 83.30% to less number of features 
sparsity (PC-GITA (GST), 76.7% - Not suitable for larger 
using the database) (MFCC), dataset 
Gini index 60% (HNR), 

57% 
(Shimmer), 
52% (Jitter), 
54.40% (Fo) 
[38] - - TEFS CNN PC-GITA Accuracy Less feature 
database =85.75, discrimination due to 
AUC=0.93 higher intra-class and 
lower interclass variability 
- Can not handle complex 
auditory models 

[39] - - SIFT, Time- Universal Accuracy= - Class imbalance problem 
spectrogram Frequency Accessand 98.00% - Complexity of network 
s using SFF, CNN TORGO (female), - High computation time 
CQT 95.80% 

(male) 

[26] - - Spectral, LDA, NEMOUR Average - Ability to classify speech 
cepstral, CART, NB, S database ranking score based on severity level 
voice ANN, for RF and - Feature selection is 
quality, SVM, and relief feature important for DSR 
prosodic, RF selection - Not applicable for larger 
overall (4.88) dataset 
speech - Less performance than 
features DL approaches 

[27] - - SVD T-GDA PC-GITA, Accuracy- Temporal subspaces 

MoSpeeDi, 82.0+3.5% provide better 

UASpeech (PC-GITA), representation of normal 
80.544.7% and  dysarthric speech 
(MoSpeeDi), compared with spectral 
96.30% (UA) subspaces 

[41] - - Pearson’s RNN UASpeech PCC (0.950), Provides high correlation 
correlation database SCC (0.957) for synthesized speech 
coefficient generated using TTS 
and 
Spearman’s 
correlation 
coefficient 
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This paper presents a comprehensive survey of distinct ML-based and DL-based DSR systems. It 
focuses on the DSR methodology that comprises enhancement, data augmentation, feature extraction, feature 
selection, and classification techniques. It analyses the dataset, experimental results, and performance metrics 
to depict the merits, demerits, and challenges of the present DSR systems. Additionally the performance of 
the various ML and DL based DSR schems is evaluated on the UASpeech dataset and results are analyzed 
using accuracy, recall, precision, and Fl-score. The rest of paper is structured as follow: section 2 depicts the 
generalized process of the automatic DSR and gives the succinct survey of recent ML and DL based speech 
emotion recognition (SER) systems, section 2 elaborates the detailed description of the method, section 3 
gives detailed results and its findings, and section 4 concludes the paper and paves the way for future 
enhancement through future scope. 


2. RESEARCH METHOD 

The process of the proposed analysis of different feature extraction and classification techniques for 
the DSR is illustrated in the Figure 4. The proposed system used pre-emphasis filtering which uses the 
moving average filter for minimizing noise and normalizing the speech. It diminish the irregularities present 
in the speech signal. 


Dataset Speech enhancement Feature TESES Dysartric speech 
extraction recognition 


UASpeech dataset 


* 70% training data 
* 30% testing data 


Pre-emphasis 


DL classifier 

* DNN 

* DCNN 

* LSTM 

* DCNN+LSTM 


Figure 4. Proposed research method of DSR 


The proposed system accepts the speech samples from the UASpeech dataset. The samples are 
cropped or appended to 10 second duration to make all data uniform. Out of total UASpeech data 70% and 
30% samples are taken for training and testing purpose. It considers various features for the MFCC, 
perceptual linear prediction (PLP) coding, linear predictive coding (LPC), wavelet packet transform (WPT) 3 
levels, relative spectra (RASTA), and CQT. The features are used to train various ML classifiers such as 
dynamic time warping (DTW), K-nearest neighbour (KNN), SVM, NB, LDA, feedforward neural network 
(FFNN), and linear vector quaintization (LVQ). The feature extracton stage consists of different features 
using traditional algorithms such as MFCC (13 MFCC features, 13 delta feature, and 13 delta-delta features), 
PLP features, LPC features (13 features), WPT features (3 level features), RASTA features, and CQT 
cepstogram features. Futher, it utilize the different ML classifiers for dysarthric voice recognition such as KNN, 
NB, SVM, DTW, LDA, LVQ, and FFNN. It considers the spectrogram representation of the signal for the two 
dimensional DL algorithms. It utilizes the deep convolutional neural network (DCNN), DNN, long short-term 
memory (LSTM), and DCNN-LSTM for the one analysis of the DSR for one dimensional siganla nd two 
dimensional speech signal. The performance of the proposed system is evaluated based on DSR accuracy. 


3. RESULTS AND DISCUSSION 

This section provides the experimental results of the various machine and DL based schemes for the 
DSR. It considers various features for the MFCC, PLP coding, LPC, WPT (3 levels), RASTA, and CQT. The 
features are used to train various ML classifiers such as DTW, KNN, SVM, NB, LDA, FFNN, and LVQ. It 
used UASpeech dataset for the experimentation as given in Table 2. It is noted that the MFCC+SVM 
provides highest 83.26% accuracy for the DSR compared with other algorithms such as DTW, KNN, NB, 
LDA, FFNN, and LVQ. It is observed that the MFCC spectrogram provides better spectral characteristics of 
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the dysarthric speech signal that helps to capture the changes occurred on the speech due to dysarthria. The 


experimentations are carried out on UASpeech dataset which is cropped for 5 sec duration. Total 1,000 
samples of normal and dysarthric speech are considered for the evaluation. 


Table 2. Performance of ML based DSR 


Feature extraction Classifiers 

techniques DTW KNN SVM NB LDA FFNN LVQ 
PLP 54.79 60.63 62.00 57.38 54.79 52.78 55.00 
RASTA 62.09 63.83 65.26 51.33 62.09 56.65 57.50 
LPC 46.23 59.09 63.00 45.33 46.23 43.45 56.45 
WPT 46.34 68.00 72.50 69.56 62.23 60.86 53.56 
CQT 65.87 72.35 78.00 71.45 68.54 63.50 59.00 
MFCC 62.23 75.54 83.26 73.35: 67.00 64.00 61.23 


Various DL based DSR schemes such as DNN, DCNN, LSTM, and DCNN-LSTM are utilized to 
evaluate the performance of DSR on UASpeech dataset as given in Figure 5. It used five layered 1-D DNN 
that gives 85% accuracy for raw speech and 87.5% accuracy for 39 MFCC coefficients that encompasses 13 
MFCC coefficients, 13 delta coefficients, and 13 delta-delta coefficients that represents the spectral variation 
over the frames of the speech. It provides 89.45% and 90.56% accuracy for 2-D representation of the speech 
signal using CQT and MFCC spectrogram. It is noted that 2D representation of the speech signal provides 
better spectral and spatial representation of the speech signal and helps to improve the accuracy over 1-D 
representation of the signal. Further, it used 5 layered DCNN which encpasses convolution, batch 
normalization, and maximum pooling layer at every layer. It uses 32, 64, 96, 128, and 256 filters for first to 
fifth layer of the DCNN. The DCNN provides gives 86.60% accuracy for raw speech and 88.80% accuracy 
for 39 MFCC coefficients. It provides 90.10% and 91% accuracy for CQT and MFCC spectrogram. 
Afterward, LSTM with five layers is employed for representing the temporal characterstics of the dysarthric 
signal which has given 85%, 86.20%, 87%, and 88.50% accuracy for the raw speecht+LSTM, MFCC 
coefficients+LSTM, CQT spectrogram+LSTM, and MFCC spectrogram+LSTM respectively. DCNN helps 
to achieve best spectral representation however lacks in time domain representation of the signal. To improve 
the time domain characteristics LSTM is collaborated with the DCNN which combines the frequency domain 
and time domain characteristics of the speech sigal for DSR. The DCNN-LSTM provides gives 88.20% 
accuracy for raw speech and 89.20% accuracy for 39 MFCC coefficients that encompasses 13 MFCC 
coefficients, 13 delta coefficients, and 13 delta-delta coefficients that represents the spectral variation over 
the frames of the speech. It provides 91.5% and 93% accuracy for 2-D representation of the speech signal 
using CQT and MFCC spectrogram. 
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Figure 5. Performance of DL based DSR 


4. CONCLUSION 

Thus, this article presents the DSR based on various ML and DL approaches that covers the 
methodology, database, evaluation metrics, advantages, disadvantages, and finding from the study. It is 
observed that the DL techniques outperformed the traditional ML techniques because of its superior feature 
representation. The DL approaches are less dependent on the hand crafted features unlike traditional ML 
based approaches. The experimental results shows that the DL based DSR schems outperforms the ML based 
DSR schemes and provides better feature representation compared with traditional handcrafted features. The 
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performance of DL framework is better for 2-D representation of the speech signal compared with 1-D signal 
because of higher representation capability in spectral and spatial domin. Also, combination DCNN and 
LSTM provides superiority over DNN, DCBB, and LSTM which has better feature representation capability 
in spectral and temporal domain. Database generation is challenging task because of unavailability of 
theproper resources and proper ground truth. The DSR is very challenging due to variability in the speech 
intelligibility because of various attributes such as language, age, gender, region, and noise. 
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